Introduction

The goal of this course was to find out if one can infer from a later divergence time (LDT) graph to a Fitch graph and thus to the actual horizontal gene transfers (HGT) that took place. One cannot infer a Fitch graph from actual gene trees. By simulating gene and species trees, we can examine both graphs and determine the similarities.

In this summary of our practical course we present the results we have gathered in the past two weeks. We were assigned the group \(B ii\), meaning that we simulated trees without loss and with replacing HGT events. The results will be presented in the same order as they appear in the course instructions.

Analysis

2.1a Fraction of Xenelogs vs. Number of Genes


Question

Is there a dependence on the size of the gene tree, i.e., the number of species and genes?


The first task consisted of simulating gene and species trees with the package asymmetree using the parameters provided in the practicum script (Tab. 1). From there we calculated LDT- and Fitch-graphs using asymmetree. Here, the fraction of xenologs, calculated as the fraction of the edges of the LDT-graph and the Fitch-graph, is plotted against the number of genes for every group of parameters (\(P0\), \(…\), \(P6\)).

Tab. 1: Dependecy of the size of the gene tree in respect to th number of genes. Results are shown for each group, slope und intercept of a linear model were calculated as well as the spearman correlation.
Group Duplication_Rate Loss_Rate HGT_Rate Slope Intercept Spearman_Corr
P0 0.25 0.25 0.25 0.0014 0.23 0.21
P1 0.50 0.50 0.50 -0.0009 0.39 0.05
P2 0.50 0.50 1.00 -0.0012 0.48 -0.02
P3 0.50 0.50 1.50 -0.0021 0.58 -0.16
P4 1.00 1.00 0.50 -0.0011 0.40 0.04
P5 1.00 1.00 1.00 -0.0016 0.53 -0.14
P6 1.50 1.50 1.50 -0.0016 0.57 -0.24

2.1a) Plots: Fraction of Xenelogs vs. Number of Genes

In the following section seven plots are shown which represent the dependency of the Fraction of Xenologs from the Number of Genes. For each group (see Tab. 1) a seperate plot and a linear model was calculated to extract the slope and intercept.

*Fraction of Xenologs* plotted against the *Number of Genes*. Red line represents linear regression and grey area represents the confidence interval.

Fraction of Xenologs plotted against the Number of Genes. Red line represents linear regression and grey area represents the confidence interval.

*Fraction of Xenologs* plotted against the *Number of Genes*. Red line represents linear regression and grey area represents the confidence interval.

Fraction of Xenologs plotted against the Number of Genes. Red line represents linear regression and grey area represents the confidence interval.

*Fraction of Xenologs* plotted against the *Number of Genes*. Red line represents linear regression and grey area represents the confidence interval.

Fraction of Xenologs plotted against the Number of Genes. Red line represents linear regression and grey area represents the confidence interval.

*Fraction of Xenologs* plotted against the *Number of Genes*. Red line represents linear regression and grey area represents the confidence interval.

Fraction of Xenologs plotted against the Number of Genes. Red line represents linear regression and grey area represents the confidence interval.

*Fraction of Xenologs* plotted against the *Number of Genes*. Red line represents linear regression and grey area represents the confidence interval.

Fraction of Xenologs plotted against the Number of Genes. Red line represents linear regression and grey area represents the confidence interval.

*Fraction of Xenologs* plotted against the *Number of Genes*. Red line represents linear regression and grey area represents the confidence interval.

Fraction of Xenologs plotted against the Number of Genes. Red line represents linear regression and grey area represents the confidence interval.

*Fraction of Xenologs* plotted against the *Number of Genes*. Red line represents linear regression and grey area represents the confidence interval.

Fraction of Xenologs plotted against the Number of Genes. Red line represents linear regression and grey area represents the confidence interval.

\(P1\) is the only group that shows a positive trend with a slope of \(0.0014\), all other groups show a negative slope ranging from \(-0.0009\) \((P1)\) to \(-0.0021\) \((P3)\). The Spearman correlation (SC) reveals that the strongest positive correlation can be found in \(P0\) with a value of \(0.21\), which is still considered weak by the SC standard \((<0.39)\). On the other hand \(P2, P4\) and \(P1\) show very low SC values of \(-0.02, 0.04\) and \(0.05\) respectively. \(P5\) and \(P3\), show slightly stronger negative SC values with \(-0.14, -0.16\), respectively and \(P6\) has the strongest correlation coefficient of our data with a value of \(-0.024\).

All groups show only weak slopes and correlation coefficients making them difficult to analyse and rather inconclusive. We therefore assume the number of genes to have very little impact on the Fraction of Xenologs. The data seems to be unevenly distributed, especially on the y-axis, where the data cumulate strongly at \(0\) and \(1\).

2.1a) Fraction of Xenologs vs. Number of Species

2.1a) Plots: Fraction of Xenelogs vs. Number of Species

During the simulation, each tree was given a random number of maximum species ranging from \(10\) to \(50\). In the following section seven plots are shown which represent the dependency of the Fraction of Xenologs from the Number of Species. For each group (see Tab. 2) a seperate plot and a linear model was calculated to calculate the slope and intercept of the dependency.

Scatterplot of the *Fraction of Xenologs* plotted against the *Number of Species*  including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the Fraction of Xenologs plotted against the Number of Species including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the *Fraction of Xenologs* plotted against the *Number of Species*  including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the Fraction of Xenologs plotted against the Number of Species including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the *Fraction of Xenologs* plotted against the *Number of Species*  including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the Fraction of Xenologs plotted against the Number of Species including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the *Fraction of Xenologs* plotted against the *Number of Species*  including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the Fraction of Xenologs plotted against the Number of Species including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the *Fraction of Xenologs* plotted against the *Number of Species*  including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the Fraction of Xenologs plotted against the Number of Species including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the *Fraction of Xenologs* plotted against the *Number of Species*  including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the Fraction of Xenologs plotted against the Number of Species including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the *Fraction of Xenologs* plotted against the *Number of Species*  including a lineare model (red line) with the confidence intervall (grey).

Scatterplot of the Fraction of Xenologs plotted against the Number of Species including a lineare model (red line) with the confidence intervall (grey).

Tab. 2: Dependecy of the size of the gene tree in respect to th number of species. Results of for each Group, slope und intercept of a linear model were calculated as well as the spearman correlation value.
Group Duplication_Rate Loss_Rate HGT_Rate Slope Intercept Spearman_Corr
P0 0.25 0.25 0.25 0.0015 0.23 0.09
P1 0.50 0.50 0.50 0.0004 0.34 0.04
P2 0.50 0.50 1.00 0.0004 0.42 0.02
P3 0.50 0.50 1.50 -0.0022 0.55 -0.09
P4 1.00 1.00 0.50 0.0002 0.34 0.02
P5 1.00 1.00 1.00 -0.0016 0.49 -0.04
P6 1.50 1.50 1.50 0.0001 0.46 0.00

All groups, except \(P3\) and \(P5\), showing a positive correlation regarding the Fraction of Xenologs in dependence o the Number of Species. The result for each group is visualized in Tab. 2. The spearman correlation coefficients range from \(-0.09\) to \(0.09\).

We found only minor correlations within our data and conclude that there is no relationship between the number of species and the fraction of xenologs. The only interesting fact we could find was the polarity of the data, we saw a accumulation of values for Fraction of Xenologs at \(0\) and \(1\), with the majority focusing at \(0\).

2.1b) Fraction of Xenelogs with a fixed HGT


Question

How does the fraction depend on the rate of duplications and losses for a fixed horizontal transfer rate?


As shown in Tab. 1 and Tab. 2 the duplication and loss rate is increasing in the same manner for each simulation group. Therefore we combine the duplication and loss rate into one factor.

Plot: Duplication / Loss Rate

In Figure 15 the Fraction of Xenologs is plotted against the Duplication and/or Loss Rate with a fixed Horizontal Gene Transfer Rate (HGT). In addition to the whisker-boxplot, the values for each simulated tree is plotted. Since we grouped our values by the Duplication Rate the Boxplots use values from simulated trees with a different HGT rate.

**Fig. 15**: Boxplot of the Fraction of Xenologs plotted against the duplication rate with a fixed horizontal gene transfer (HGT) rate. The different colors marking the groups with the same HGT rate.

Fig. 15: Boxplot of the Fraction of Xenologs plotted against the duplication rate with a fixed horizontal gene transfer (HGT) rate. The different colors marking the groups with the same HGT rate.

The Fraction of Xenologs is increasing with an increasing Duplication Rate. Although a duplication Rate of \(1.0\) has a less ammount of HGT events compared whith a rate of \(0.5\) and \(1.5\).

The Duplication and Loss Rate only have a minor impact on the Fractions of Xenologs. Only when the Duplication/Loss Rate is too low the Fractions of Xenologs is decreasing compared to higher values of the Duplication/Loss Rate (\(0.5-1.5\)).

2.1c) Fraction of Xenologs with fixed Loss Rate


Question

How does the fraction depend on the horizontal transfer rate with a fixed duplication and loss rate?


As shown in Tab. 1 and Tab. 2 the duplication and loss rate is increasing in the same manner for each simulation group. Therefore we combine the duplication and loss rate into one factor.

2.1c) Plots: Fraction vs. HGT fixes Loss

In Figure 15 the Fraction of Xenologs is plotted against the Horizontal Gene Transfer Rate (HGT) with a fixed Duplication and/or Loss Rate. In addition to the whisker-boxplot, the values for each simulated tree is plotted. Since we grouped our values by the HGT Rate the Boxplots use values from simulated trees with a different duplication rate.

**Fig 16**: Boxplot of the Fraction of Xenologs plotted against the HGT rate with a fixed duplication and loss rate. The different colors marking the groups with the same duplication or loss rate.

Fig 16: Boxplot of the Fraction of Xenologs plotted against the HGT rate with a fixed duplication and loss rate. The different colors marking the groups with the same duplication or loss rate.

The Fraction of Xenologs increases with increasing HGT-rate. We would expect as much, as the HGT-rate is responsible for the amount of horizontal gene transfers which directly influences the amount of edges in the graphs.

2.1d) Fraction of Xenologs vs. Multifurcations


Question

How does the fraction depend on the frequency of multifurctions.


2.1d Plot Fraction vs. Multifurcations

In Figure 17 the fraction of xenologs is plotted against the parameter multifurcation probability. This parameter randomly takes values between \(0\) and \(0.5\) and for reasons of clarity, we binned the data. The multifurcation rate determines the probability at which an inner node has more than 2 children.

Figure 17: Boxplot of *Fraction of Xenologs* against multifurcation rate. The latter is binned (bin interval = $0.1$).

Figure 17: Boxplot of Fraction of Xenologs against multifurcation rate. The latter is binned (bin interval = \(0.1\)).

With higher Multifurcation Rates the average Fraction of Xenologs diminishes. The range of values does not change, the median drops slightly and the \(25th\) and \(75th\) percentile seem to have a lower deviation from the median. With higher Multifurcation Rate the number of genes is increasing, therefore the overall Fraction of Xenologs is decreasing.

2.2 Fitch from LDT with CD


Question

Second we consider the dependencies for the edges in Fitch graphs computed from an LDT graph. Here the following variants should be considered:

  • Complete multipartite graph obtained by solving the Cluster Deletion Problem for the complement of the LDT (see webpage).
  • The \(rs-Fitch\) graph of the scenario computed with “Algorithm 1” from Rbelow.pdf (the latter is already implemented in AsymmeTree).

Plots

Plots Cluster Deletion

**Fig. 18**: CD: Mean Recall Rate increases when more nodes are utilised. The mean *Recall Rate* is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

Fig. 18: CD: Mean Recall Rate increases when more nodes are utilised. The mean Recall Rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

An increase can be seen in the Recall Rate, as it rises fast from \(20\%\) to \(40\%\) in all graphs and plateaus from \(60\%\) to \(100\%\) of nodes used. At \(20\%\) the data deviate stronger from one another than can be seen at higher percentages.

**Fig. 19**: CD: *Mean Accuracy Rate* decreases when more nodes are utilised. The mean accuracy rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

Fig. 19: CD: Mean Accuracy Rate decreases when more nodes are utilised. The mean accuracy rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

The Accuracy Rates show a slight decrease with increasing percentages of the original graph used. At \(20\%\) all graphs show an almost \(100\%\) accuracy rate which drops to \(97\%\) to \(92\%\) at \(100\%\) nodes used. Meaning that with higher numbers the amount of erroneous classifications increases.

**Fig. 20**: CD: Mean *Precision Rate* decreases when more nodes are utilised. The mean precision rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

Fig. 20: CD: Mean Precision Rate decreases when more nodes are utilised. The mean precision rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

In this graph we can see two groups of lines (\(P0\), \(P1\), \(P4\) and \(P2\), \(P3\), \(P5\), \(P6\)) all of which are slightly declining with increasing percentages. It has to be noted that the decline is only minor and the deviation between groups appears to be stronger than the influence of the percentage of nodes used.

Plots Fitch (RS)

**Fig. 21**: RS: Mean *Recall Rate* increases when more nodes are utilised. The mean recall rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

Fig. 21: RS: Mean Recall Rate increases when more nodes are utilised. The mean recall rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

An increase can be seen in the recall rate, as it rises fast from \(20\%\) to \(40\%\) in all graphs and plateaus from \(60\%\) to \(100\%\) of nodes used. At \(20\%\) the data deviate stronger from one another than can be seen at higher percentages. Furthermore, it seems that the mean recall is slightly higher than the recall of the fitch graph computed with the CD algorithm. Moreover, groups with a high HGT rate (\(P2\), \(3\), \(5\), \(6\)) also have a higher recall than groups with a lower HGT rate. The higher the HGT rate and the number of nodes used, the higher the number of true positive edges compared to the real Fitch graph.

**Fig. 22**: RS: Mean Accuracy Rate decreases when more nodes are utilised. The mean accuracy rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

Fig. 22: RS: Mean Accuracy Rate decreases when more nodes are utilised. The mean accuracy rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

The accuracy rates show a slight decrease with increasing percentages of the original graph used. At \(20\%\) all graphs show an almost \(100\%\) accuracy rate which drops to \(97\%\) to \(92\%\) at \(100\%\) nodes used. Meaning that with higher numbers of nodes the amount of erroneous classifications increases. Moreover, it also seems that the accuracy of the RS algorithm is slightly better than the accuracy of the CD algorithm and the accuracy is decreasing with an increase in the HGT rate.

**Fig. 23**: RS: Mean *Precision Rate* decreases when more nodes are utilised. The mean precision rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

Fig. 23: RS: Mean Precision Rate decreases when more nodes are utilised. The mean precision rate is plotted against the percentage used of the original graph. The graphs are grouped by their parameter set.

In this graph we can see two groups of lines (\(P0\), \(P1\), \(P4\) and \(P2\), \(P3\), \(P5\), \(P6\)) all of which are slightly declining with increasing percentages. It has to be noted that the decline is only minor and the influence of the group appears to be much stronger. Because it seems that the HGT rate has a major impact on the precision (\(P0\), \(P1\), \(P4\) have the lowest HGT Rate with \(0.25\), \(0.5\) and \(0.5\)). Therefore, the number of false positives predicted by the algorithm decreases with increasing HGT Rate. Compared to the precision of the CD algorithm the precision of the RS algorithm is on average \(1\%\) to \(3\%\) lower.

Tripple T Fractions

3. Triples: Characterization of LDT Graph


Question

The triple set \(T (G)\) is related to the gene tree, while the triple set \(S(G, σ)\) is related to the species tree. It is therefore of interest to compare to what extent \(T (G)\) and \(S(G, σ)\) overlap the triple sets of true gene tree and the triple set of the true species tree, respectively. How can this be quantified in a meaningful way? Again we are interested in the dependence of the simulation parameters.


Triple T Fraction

**Fig. 24**: Fraction of species tree triple and LDT-triple depending on the simulation parameters.

Fig. 24: Fraction of species tree triple and LDT-triple depending on the simulation parameters.

Triple S Fraction

**Fig. 25**: Fraction of gene tree triple and LDT-triple depending on the simulation parameters.

Fig. 25: Fraction of gene tree triple and LDT-triple depending on the simulation parameters.

Tab. 3: Fraction of gene/species tree triple and LDT-triple depending on the simulation parameters.
Group T_Triple_Mean T_Triple_Median S_Triple_Mean S_Triple_Median
P0 0.0542 0.0146 0.0713 0.0037
P1 0.0914 0.0431 0.1025 0.0399
P2 0.1262 0.0866 0.1782 0.1131
P3 0.1378 0.1001 0.2257 0.1781
P4 0.0854 0.0445 0.1013 0.0367
P5 0.1198 0.0796 0.1693 0.1142
P6 0.1390 0.0991 0.1963 0.1338

In Figure 24, 25 and in Table 3, it can be seen that the fractions of triples that can be detected in the LDT graph and in the gene/species trees strongly depend on the HGT Rate. The triple fraction was calculated by dividing the informative triple quantity by the triple quantity of the corresponding tree. It can be seen that the fraction increases from group \(P0\) to \(P3\) and from group \(P4\) to \(P6\) (from below \(1\%\) to almost \(20\%\)), just like the HGT parameter in the simulation conditions. The trend is more pronounced when the triple amount of the LDT graph is compared to the triple amount of the species tree S. The other simulation parameters play a minor role as can be seen from the comparison of groups \(P1\), \(P2\) and \(P3\) with groups \(P4\), \(P5\) and \(P6\). When the simulation parameters of \(P1\) are compared with \(P4\), \(P2\) with \(P5\) and \(P3\) with \(P6\), the HGT Rate is the same, but not the other two parameters Loss and Duplication Rate. Nevertheless, the mean and median of the groups are very similar. This is due to two reasons: 1. to extract informative triples from the LDT graph, the LDT graph needs edges (i.e. HGT events in the trees). Therefore, with more edges, more triples can be extracted from the LDT graph. That is why the fraction of triples that can be detected in LDT graph and tree increases with HGT Rate. 2. We were working with replacing HGT in our group, which did not add new genes by HGT events. Thus, the triple quantity of the gene tree also did not increase due to HGT events, which could have negated the effect of the new edges in the LDT graph. Therefore, a different trend in fraction between gene tree and LDT triples should be seen in the groups with additive HGT. One reason why the trend in Figure Y (fractions of species tree and LDT triples) is slightly stronger than in Figure X could be that two edges are needed for a species tree triple in the LDT graph instead of just one.